Skip to content

fix: add short-name package for E1 judge env#71

Merged
intertwine merged 4 commits intomainfrom
fix/e1-judge-short-package
Mar 7, 2026
Merged

fix: add short-name package for E1 judge env#71
intertwine merged 4 commits intomainfrom
fix/e1-judge-short-package

Conversation

@intertwine
Copy link
Copy Markdown
Owner

@intertwine intertwine commented Mar 6, 2026

Summary

  • add a dedicated short-name judge environment package at environments/sv-env-netlogs-judge
  • repoint configs/rl/e1_judge.toml to intertwine/sv-netlogs-judge
  • document the Prime 20-char label truncation bug that breaks sv-env-network-logs-judge

Why

Prime truncates long environment names when deriving Kubernetes labels. The original judge env name truncates to sv-env-network-logs-, which is invalid because it ends with -. A simple Hub rename was not enough because Prime also expects the installed distribution metadata to match the short name.

This PR publishes the judge variant as a real short-name package (sv-netlogs-judge) so hosted RL runs can both install and launch cleanly.

Test Plan

  • pytest environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py -q
  • pushed intertwine/sv-netlogs-judge v0.2.17 and verified Prime integration action succeeds
  • launched hosted judge run on Prime using OpenRouter-backed OpenAI-compatible env vars plus HF dataset access
  • confirmed live run: i5fvjomdr3yjn3pg6y7x3nu8

Notes

  • active W&B run: intertwine/svbench-e1/runs/bhsxeoiv
  • config now uses intertwine-ai/security-verifiers-e1 via E1_HF_REPO for hosted dataset loading

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a22d596920

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread Makefile Outdated

# Optional Hub environment name override (defaults to pyproject.toml name)
# Usage: make hub-deploy E=network-logs NAME=sv-env-network-logs-judge
# Usage: make hub-deploy E=network-logs NAME=sv-netlogs-judge
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the short-package path in hub-deploy usage example

The new usage comment points to make hub-deploy E=network-logs NAME=sv-netlogs-judge, but the deploy recipe builds from environments/sv-env-$(E), so that command still packages the old sv-env-network-logs distribution instead of sv-env-netlogs-judge. Following this instruction can reintroduce the package-metadata mismatch this commit is trying to avoid and lead to failed installs/runs for the short ID.

Useful? React with 👍 / 👎.


if logger and logger.enabled:
logger.log_environment_init(
environment_name="sv-env-network-logs-judge",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align logged environment name with the short alias

The short-name entrypoint wraps this loader, but log_environment_init still emits environment_name="sv-env-network-logs-judge" here before the wrapper renames the returned env object. In runs where rollout logging is enabled, sv-netlogs-judge executions will be recorded under the old ID, which can mix telemetry across env IDs and make experiment attribution unreliable.

Useful? React with 👍 / 👎.

@intertwine
Copy link
Copy Markdown
Owner Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 7, 2026

Claude finished @intertwine's task in 2m 37s —— View job


Code Review

  • Gather context and read changed files
  • Review new package structure (sv-env-netlogs-judge)
  • Review entrypoint and loader logic
  • Review test coverage
  • Review Makefile and config changes
  • Summarize findings

Summary

The approach is sound — publishing a short-name package to work around Prime's 20-character Kubernetes label truncation bug is the right move, and the wrapper pattern in sv_netlogs_judge.py is clean. Two issues need fixes before merge, and a few minor points worth addressing.


Issues

🔴 P1 — Makefile comment still points to the wrong source env

Makefile:39

# Usage: make hub-deploy E=network-logs NAME=sv-netlogs-judge

The deploy target does cd environments/sv-env-$(E), so E=network-logs builds and pushes from environments/sv-env-network-logs (the base classifier), not the new judge package. The NAME override only changes the Hub label sent to prime env push --name, not the package source. This is the exact package-metadata mismatch the PR is trying to prevent.

The correct invocation is:

# Usage: make hub-deploy E=netlogs-judge

No NAME override is needed because pyproject.toml already declares name = "sv-netlogs-judge". Fix this →


🟡 P2 — Telemetry is logged under the old ID before the wrapper renames it

sv_env_network_logs_judge.py:267-276

logger.log_environment_init(
    environment_name="sv-env-network-logs-judge",  # ← logged here
    ...
)

sv_netlogs_judge.py:19-21

env = _load_environment(**kwargs)
env.name = SHORT_ENV_ID       # ← renamed after the fact
env.env_id = SHORT_ENV_ID

When a logger is passed to the wrapper, log_environment_init fires inside _load_environment with "sv-env-network-logs-judge". By the time the wrapper patches env.name, the telemetry event has already been emitted. In a WandB/Weave run launched via sv-netlogs-judge, the init event will appear under the old ID, mixing telemetry across env IDs.

The simplest fix is to accept the env name as an optional parameter in load_environment:

# sv_env_network_logs_judge.py
def load_environment(
    ...
    env_name: str = "sv-env-network-logs-judge",
    ...
):
    ...
    logger.log_environment_init(environment_name=env_name, ...)
    return vf.SingleTurnEnv(name=env_name, ...)
# sv_netlogs_judge.py
def load_environment(**kwargs):
    kwargs.setdefault("env_name", SHORT_ENV_ID)
    return _load_environment(**kwargs)

Fix this →


Minor Points

Test file included in the wheel (pyproject.toml:43)
sv_netlogs_judge_test.py is listed under [tool.hatch.build] include. Tests shouldn't ship in production wheels — they inflate the artifact and can conflict with the Hub install. Compare with how sv-env-network-logs handles this. Fix this →

PR body test-plan path typo
The test plan says pytest environments/sv-env-netlogs-judge/sv_env_network_logs_judge_test.py but the actual file is sv_netlogs_judge_test.py. Not a code issue, but worth correcting in the description.

env.env_id is non-standard (sv_netlogs_judge.py:21)
env.env_id = SHORT_ENV_ID dynamically adds an attribute that doesn't exist on vf.SingleTurnEnv. If the Verifiers framework doesn't read env_id anywhere, this is a no-op. Worth double-checking whether Prime reads env_id vs name to pick the entrypoint key — if name is sufficient, the env_id assignment can be dropped.

Duplicate sys.path manipulation (sv_env_network_logs_judge.py:20-53)
Two separate try/except import blocks both call sys.path.append in their fallback paths. If the weave_init import succeeds but sv_shared fails (unlikely in practice), the path gets set correctly. No bug, but this could be simplified by computing the path once at module level.


What's Working Well

  • The wrapper pattern in sv_netlogs_judge.py is minimal and correct — delegates everything to the main loader and patches env.name.
  • judge_sampling_args={"max_tokens": 16, "temperature": 0.0} is the right configuration for a binary yes/no signal.
  • Test coverage is solid: parser, prompt placeholders, all four judge response cases (yes, yes with trailing text, no, maybe), environment loading, and the short-alias test.
  • e1_judge.toml is clearly matched to e1.toml for controlled WP3c comparison — the comment block documenting the reward differences is very useful.
  • The research doc update correctly points Condition B to the new Hub name.

- wire env_name through the short judge loader for consistent telemetry
- rename the short-package implementation module to avoid collisions
- fix hub-deploy docs and exclude tests from the wheel
@intertwine
Copy link
Copy Markdown
Owner Author

@claude re-review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 7, 2026

Claude finished @intertwine's task in 1m 46s —— View job


Re-review

  • Gather context
  • Read changed files
  • Check if previous issues were addressed
  • Post updated review

Summary

All P1 and P2 issues from the previous review have been fixed. The two remaining items are minor stale references in the research doc. This PR is ready to merge.


Previous Issues — All Resolved ✅

P1 — Makefile comment (Makefile:39)

# Usage: make hub-deploy E=netlogs-judge

Fixed. The comment now correctly points to E=netlogs-judge, which maps to environments/sv-env-netlogs-judge. No NAME override needed.

P2 — Telemetry logged under old ID
Fixed. sv_netlogs_judge_impl.py now accepts env_name: str = DEFAULT_ENV_NAME and passes it to both log_environment_init and vf.SingleTurnEnv(name=env_name, env_id=env_name, ...). The wrapper sets kwargs.setdefault("env_name", SHORT_ENV_ID) before delegating, so telemetry emits under "sv-netlogs-judge" from the start.

Test file in wheel
Fixed. pyproject.toml [tool.hatch.build] now includes only sv_netlogs_judge.py and sv_netlogs_judge_impl.py — the test file is correctly excluded.

Test coverage for the fix
TestLoadEnvironment.test_short_alias_loads_environment (line 172) now explicitly verifies that env.name, env.env_id, and log_environment_init(environment_name=...) all use "sv-netlogs-judge" when called via the short-name wrapper. This is exactly the regression test the previous review was asking for.


Remaining Minor Items

Stale status checklist in research doc (reward_source_comparison.md:104–119)

Two items are out of date:

  1. - [ ] Deploy judge environment to Hub — The PR body confirms v0.2.17 is live at intertwine/sv-netlogs-judge. This should be checked off.
  2. The Files table still lists environments/sv-env-network-logs/sv_env_network_logs_judge.py as the judge variant file, but the implementation has moved to environments/sv-env-netlogs-judge/sv_netlogs_judge_impl.py.

Neither blocks merge, but worth tidying up. Fix this →

Pre-existing: REPO_ROOT path resolution (sv_netlogs_judge_impl.py:26)

REPO_ROOT = str(Path(__file__).resolve().parents[2])

When installed as a wheel, __file__ is inside site-packages and parents[2] won't resolve to the repo root. This is the same pattern used in the other environments, so it's a pre-existing limitation rather than a regression. The installed package path takes precedence in the try block anyway, so it only matters if sv_shared isn't co-installed — which won't happen in a Prime environment. Not a blocker.


What's Working Well

The implementation is clean and the test suite is comprehensive. In particular, test_short_alias_loads_environment directly validates the core correctness property — that the wrapper correctly threads env_name through so that env.name, env.env_id, and telemetry are all consistent with "sv-netlogs-judge". The judge_sampling_args={"max_tokens": 16, "temperature": 0.0} configuration is appropriate for the binary yes/no signal.

@intertwine intertwine merged commit 3e66fd0 into main Mar 7, 2026
5 checks passed
@intertwine intertwine deleted the fix/e1-judge-short-package branch March 7, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant